Introduction to Data Science with R

Session 5b: Tidying data

Ina Bornkessel-Schlesewsky

November 29, 2023

Recall

  • import data (into R)
  • tidy data
    • bring it into a consistent format that can be used for multiple purposes (each column = variable; each row = observation)
    • lets you focus on understanding the data rather than which format you need
  • transform data
    • e.g. focus on observations of interest (such as those from a particular location), create new variables (such as speed from distance and time), compute summary statistics
  • visualise data
    • essential for understanding
  • model data
    • use (statistical) models to answer your questions about the data
  • communicate insights

Note how you already know the basics of a number of these steps!

Data structure

  • Most datasets are made up of rows and columns
  • Many ways to structure the same data (see figure)
  • Datasets are collections of values (either numbers or strings); these belong to a variable and an observation
  • “A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units.”
  • “An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.” (Wickham, 2014)

figure from Wickham (2014)

Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1 - 23. doi:http://dx.doi.org/10.18637/jss.v059.i10

Data structure

  • The same dataset but with observations in rows and variables in columns

figure from Wickham (2014)

Tidy data

Characteristics of tidy data:

  • Each variable has its own column
  • Each observation has its own row
  • Each value has its own cell

figure from R4DS

Different data formats

  • The three data frames on the next slide show example data provided in the tidyr package, which is part of the tidyverse.

“all display the number of TB cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000”

(from ?tidyr::table1)

  • Which of these tables is tidy?

If you want to inspect the data yourself, you can access them via table1, table2 and table3.

Different data formats

[1] "table1"
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583
[1] "table3"
# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583
[1] "table2"
# A tibble: 6 × 4
  country      year type           count
  <chr>       <dbl> <chr>          <dbl>
1 Afghanistan  1999 cases            745
2 Afghanistan  1999 population  19987071
3 Afghanistan  2000 cases           2666
4 Afghanistan  2000 population  20595360
5 Brazil       1999 cases          37737
6 Brazil       1999 population 172006362

Different data formats


  • If you said table1, you were right!


  • Another possibility is for data to be spread across two data frames
  • See the next slide for an example (table4a and table4b from tidyr)

Different data formats

[1] "table4a"
# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766
[1] "table4b"
# A tibble: 3 × 3
  country         `1999`     `2000`
  <chr>            <dbl>      <dbl>
1 Afghanistan   19987071   20595360
2 Brazil       172006362  174504898
3 China       1272915272 1280428583

Advantages to working with tidy data


  1. Having one consistent format for data makes it easier to learn the tools required for analysis (which can have a certain uniformity). The tidyverse packages, for example, are designed to work with tidy data (who would have thought! 😂)

  2. It is advantageous for variables to be placed in columns because this caters to R’s vectorised nature. (Most R-functions work with vectors of values.)

Data are often untidy

Common problems

  1. Variables are spread across multiple columns
  2. Observations are spread across multiple rows

The solution

  • functions pivot_longer() and pivot_wider() in tidyr!

Note: this doesn’t mean that non-tidy data are “bad”. There can be many reasons for why a dataset is in a non-tidy format, e.g. ease of data entry if this is being done manually.

Pivot to longer

Common problem: column names are values of a variable rather than variables

Example: table4a

# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

Solution: pivot these columns to new variables, rendering the dataset longer

We need:

  • the columns with values as names (1999 and 2000)
  • the name of the variable to move the column names to (year)
  • the name of the variable to move the column values to (cases)

Pivot to longer

table4a  |>  
  pivot_longer(c(`1999`,`2000`), names_to = "year", values_to = "cases")
# A tibble: 6 × 3
  country     year   cases
  <chr>       <chr>  <dbl>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

Pivot to longer

from R4DS

Exercise: try doing the same thing with table4b!

Excursus 1: joining tables

We can easily join the longer versions of table4a and table4b using left_join() (more on joining operations later):

tidy4a <- table4a %>% 
  pivot_longer(c(`1999`, `2000`), 
               names_to = "year", 
               values_to = "cases")

tidy4b <- table4b %>% 
  pivot_longer(c(`1999`, `2000`), 
               names_to = "year", 
               values_to = "population")

left_join(tidy4a, tidy4b)
# A tibble: 6 × 4
  country     year   cases population
  <chr>       <chr>  <dbl>      <dbl>
1 Afghanistan 1999     745   19987071
2 Afghanistan 2000    2666   20595360
3 Brazil      1999   37737  172006362
4 Brazil      2000   80488  174504898
5 China       1999  212258 1272915272
6 China       2000  213766 1280428583

Excursus 2: additional tools for cleaning data

  • the {janitor} package includes a number of useful functions for cleaning data
  • one of these is clean_names(), which cleans up problematic variable names (e.g. names with spaces, starting with a digit etc.)
table4a |> 
  janitor::clean_names()
# A tibble: 3 × 3
  country      x1999  x2000
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

Pivot to wider

  • pivot_wider() is the counterpart of pivot_longer() which you need when observations are spread across multiple rows such as in table2
  • here, the table needs to be made wider
# A tibble: 6 × 4
  country      year type           count
  <chr>       <dbl> <chr>          <dbl>
1 Afghanistan  1999 cases            745
2 Afghanistan  1999 population  19987071
3 Afghanistan  2000 cases           2666
4 Afghanistan  2000 population  20595360
5 Brazil       1999 cases          37737
6 Brazil       1999 population 172006362

Pivot to wider

To tidy table2 we need:

  • the column to take variables from (type)
  • the column to take values from (count)
table2 |> 
  pivot_wider(names_from = type, values_from = count)
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Pivot to wider

from R4DS

What’s up with table3?

# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

Multiple variables in one column

  • in table3, the rate column contains both cases and population
  • to deal with this problem, we can use the separate() function
  • it allows us to easily split a column according to a delimiting character (here, the “/”)
  • note how separate is clever enough to correctly guess the delimiting character – it looks for a non-alphanumeric character by default (to specify it manually, use sep = "/")
table3 |> 
  separate(rate, into = c("cases","population"))
# A tibble: 6 × 4
  country      year cases  population
  <chr>       <dbl> <chr>  <chr>     
1 Afghanistan  1999 745    19987071  
2 Afghanistan  2000 2666   20595360  
3 Brazil       1999 37737  172006362 
4 Brazil       2000 80488  174504898 
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Multiple variables in one column

  • by default, separate retains the original column type (character in this case)
  • we can ask it to try to convert to a more suitable type using the convert parameter
table3 |> 
  separate(rate, into = c("cases","population"), convert = TRUE)
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Resources